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. We analyze Neuberger's double pass algorithm for the matrix-vector 

multiplication R{H) ■ Y (where R(H) is (n — l,n)-th degree rational 
polynomial of positive definite operator H), and show that the number 
of floating point operations is independent of the degree n, provided 
Qj", that the number of sites is much larger than the number of iterations 

'-^ in the conjugate gradient. This implies that the matrix- vector product 

can be approximated to very high precision 
with sufficiently large n, without noticeably extra costs. Further, we 
| show that there exists a threshold ut such that the double pass is faster 

than the single pass for n > ny, where ny ~ 12 — 25 for most platforms. 
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1 Introduction 



In 1998, Neuberger proposed the nested conjugate gradient [T] for solving the 
propagator of the overlap-Dirac operator |2j 



D = m 1 + 75 



with the sign function sgn(H u 
approximation 




(1) 



approximated by the polar 



S(H U 



n 



v 
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where H w = ^D W) and D w is the standard Wilson-Dirac operator plus a 
negative parameter — m (0 < m < 2), and the coefficients bi and di are: 
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In principle, any column vector of D 1 = D\DD^) 1 can be obtained by 
solving the system 



DD^Y = ml\2 + l5 S{H w ) + S(H w ) l5 }Y = 1 



(3) 



with conjugate gradient, provided that the matrix-vector product S(H W )Y can 
be carried out. Writing 



S(H W )Y = ^^ blZ U 



i=i 



one can obtain {Z^} by solving the system 



(4) 



(5) 



with multi-shift conjugate gradient [3 Ej. In other words, each (outer) CG 
iteration in Q contains a complete (inner) CG loop (jSJ), i.e., nested conjugate 
gradient. 

Evidently, the overhead for the nested conjugate gradient is the execution 
time for the inner conjugate gradient loop (0) as well as the memory space it 
requires, i.e., (2n + 3) large vectors, each of 12N site double complex numbers, 
where N S i te is the number of sites, and 12 = 3 (color) x 4 (Dirac) is the degrees 
of freedom at each site for QCD. The memory storage becomes prohibitive for 
large lattices since n is often required to be larger than 16 in order to achieve a 
reliable approximation for the sign function. To minimize the memory storage 
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for {PW, Z^}, Neuberger [S] observed that one only needs the linear combina- 
tion H]=i biZ^ rather than each Z^> individually. Since zf' and its conjugate 
vector pf at the z-th iteration of the inner CG are linear combinations of 
their preceedents {Pf \ zf\j — 0, • • • , i— 1} in the iteration process, thus it is 
possible to obtain their updating coefficients {oq\pj; } in the first pass, and 
then use them to update the sum J2?=i hZ^ successively in the second pass, 
with memory storage of only 5 vectors, independent of the degree n of the 
rational polynomial PS n ~ l,n \ 

At first sight, the double pass algorithm seems to be slower than the single 
pass algorithm. However, in the test run (with SU(2) gauge field on the 8 3 
lattice), Neuberger found that the double pass actually ran faster by 30% than 
the single pass, and remarked that the speedup most likely reflects the cache 
usage in the testing platform, the SGI O2000 (with four processors, each with 
4MB cache memory). 

In this paper, we analyze the number of floating point operations (F 2 ) for 
the double pass algorithm, and show that it is independent of the degree n of 
PS n ~ 1,n \ provided that the number of lattice sites (N S i te ) is much larger than 
the number of iterations (Li) of the CG loop. The last condition is amply 
satisfied even for a small lattice (e.g., N s u e = 8 3 x 24 = 12288), since Li is 
usually less than 1000 (after the low-lying eigenmodes of are projected 
out). On the other hand, the number of floating point operations (P\) for 
the single pass is a linearly increasing function of n. It follows that there 
exists a threshold np such that F 2 < F± for n > np, where the value of rip 
depends on the implementation of the algorithms {yip ~ 59 for our codes). 
Corresponding to the number of floating point operations, we also obtain the 
formulas for the CPU times (T\ and T 2 ) for the single and the double pass 
algorithms. Further, we show that there exists a threshold rip such that the 
double pass is faster than the single pass 1 for n > np, where np ~ 12 — 25 (for 
most platforms), which is quite smaller than the threshold np ~ 59 for the 
number of floating point operations. By timing the speed of each subroutine, 
we can account for the extra slow down in the single pass algorithm, which is 
unlikely to be eliminated, due to the memory bandwidth, a generic weakness 
of any computational system. Thus, in general (for most vector or superscalar 
machines), one may find that the double pass is faster than the single pass, for 
n > np ~ 12 — 25. This explains why in Neuberger's test run, even at n = 32, 
the double pass is already faster than the single pass by 30%. In fact, we find 
that DEC alpha XP1000 and IBM SP2 SMP also have 30% speedup at n = 32 
(see Table EJ), which agrees with the theoretical estimate (J32J) using the CPU 
time formulas (1271) -(I2HD. 

Nevertheless, the most interesting result is that the speed of the double pass 

1 In this paper, we only consider the (faster) single pass algorithm in which the vectors 
PW and ZW (I = 2, • • • , n) are not updated after converges. 
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algorithm is almost independent of the degree n. This implies that the matrix- 
vector product (#2)-i/2y ^ R (n-l,n) (#2)y 

can be approximated to very high 
precision with sufficiently large n, without noticeably extra costs. 

The outline of this paper is as follows. In section 2, we outline the single 
and double pass algorithms for the iteration of the (inner) CG loop (J3J), and 
analyze their major differences. In section 3, we estimate the number of (double 
precision) floating point operations as well as the CPU time for the single and 
double pass algorithms respectively, and show that there exists a threshold ut 
such that the double pass is faster than the single pass for n > ut- In section 
4, we perform some tests. In section 5, we conclude with some remarks. 



2 The single and double pass algorithms 

In the section, we outline the single and the double pass algorithms for the 
inner CG loop ©, and point out their major differences. 

For the single pass algorithm, with the input vector Y, we initialize the 
vector variables {Z®, P®}, R, A, B and the scalar variables a, (3, { 7 ^} as 

4° = o, p (0 = r, z = i,...,n 

R = Y, 
a-i = 1, 
(3o = 0, 

7 P)= 7 J) = 1, / = !,..., n 



Then we iterate (j = 0, 1, • • •) according to: 



A, 


= H P (1) 


Bj 


= H w A j + d 1 P^ = 
( R j\ R j) 


a d 


- (P^B,) 


R j+ i 


= Rj - ajBj 


Pj+i 


(R j+1 \R j+ i) 
(Rj\Rj) 


P (i) 


= Rj+i + Pj+iPj 


z i+i 


= ^+0^ 



(6) 

(H 2 W + (7) 

(8) 

(9) 
(10) 

(11) 
(12) 



together with the following updates for I = 2, • • • , n: 

(0 (0 

Jj+1 a r ( l ) (0\ , (0 (-1 1 1 a a \\ K ' 
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p(J) 



7 (Q 



(0 D 



+ /3. 



'i+i 



(0 

Tj+i ] p (i) 
7<" ' ' 



(0 



, ? (0 i n . p(0 
7i 



(14) 
(15) 



The loop terminates at the z-th iteration if (R i+ i\R i+ i) / (Y\Y) is less than 
the tolerance (tol). 

Since we are only interested in the linear combination Ya=\ bi^i+ii m which 
each zf^ can be expressed in terms of {Rj,j = 0, ■ ■ ■ ,i}, thus we can write 



(16) 



i=i 



3=0 



where Cj can be derived as 



C 3 = E 



*-3 

E 

m=0 



a j+m S m Ui + E 6 



"m+j'+l Im+j 



1=2 7 



(0 



(17) 



with 



^ = 1^^' T m> °> (18) 
| 1 , for m = 0. v 1 

Therefore, the r.h.s. of ([TfiJ) can be evaluated with the CG loop (|5Jl- (fTTJl . requir- 
ing only the storage of 5 large vectors (A, B, R, P^\T = Y^CjRj), provided 
that the coefficients {cj,j = 0, are known. However, from (|17|). the 

determination of Cj at any j-th iteration requires some values of {a}, {(3} and 
{7} which can only be obtained in later iterations. Thus we have to run the 
first pass, i.e., the CG loop (fB])-(fTT|). to obtain all coefficients of {a}, {[3} up 
to the convergence point i, and then compute all {cj,j = 0, • ■ • ,i} according 
to (jT7jl and (JT3J). Finally, we run the second pass, i.e., going through (jUJ), (J7j), 
©, (fTTj) . and adding CjRj to the r.h.s. of (fHjj) . successively from j — to the 
convergence point i. 

Evidently, all operations in (J5j)- (|T2j) and (|T4 ]l -([T3 |l are proportional to the 
number of lattice sites N S i te times the number of iterations Lj. On the other 
hand, the computations of the coefficients {7} (fTTJ|) and {c, } (fT7|) do not depend 
on N site , but only on Lj (up to a small term proportional to Lf). Thus, for 
N si te 3> Li, we can neglect the computation of {cj} (fT7j) . and focus on the 
major difference between the single pass and the double pass, namely, the 
number of operations in (|14 p -([15 p . which is proportional to (n — l)N S i te Li, 
versus the number of operations in (jUJ), (JZJ), ©, (jTTjl plus the vector update 
on the r.h.s. of (|16j) . which is proportional to N site Li. Obviously, the number 
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of floating-point operations in the single pass is a linearly increasing function 
of n, while that of the double pass is independent of n, thus it follows that the 
double pass must be faster than the single pass for sufficiently large n. 

In the next section, we estimate the number of floating point operations 
as well as the CPU time, for the single pass and the double pass respectively. 
Even though our countings are based on our codes, they serve to illustrate the 
general features of the single and the double pass algorithms, which are valid 
for any software implementations and/or machines. 



3 The CPU time and the number of floating 
point operations 

For our codes, the number of floating point operations for the single pass is 

F 1 = AU^[3552 + 120(n - l)p) + N slte (288n ev + A8n + 1776) , (19) 
while for the double pass is 



6888 N site Li + N slte {288n ev - 1656) 



+ 



6 



+ Lj(2n- 1) + L t [13n 



73 
~6~ 



7n + 7 



Q 



(20) 



where N S i te is the number of sites of the lattice, Lj is the number of iterations 
of the CG loop, n ev is the number of projected eigenmodes of H^, and n is 
the degree of the rational polynomial PS n ~ l,n \ In the single pass, Eq. (|19|). 
in — l)p is the effective number of the {n — 1) updates in (fn|) - (fTB]) . since 
and Z® are not updated after Z® converges. The value of p depends 
on the convergence criteria as well as the rational polynomial _R( n-1 >™) and its 
argument. Similarly, in the double pass, the sum in ()17j) only includes the 
terms which have not yet converged at the iteration j, and the reduction in 
the number of floating point operations can be taken into account by the factor 
q in (|20j) . (The value of q is about 0.95 for convergence up to the zero in the 
IEEE double precision representation). 

Taking into account of different speeds of various floating point operations, 
we estimate the CPU time for the single pass and the double pass as follows. 



Ti = N S it e Li [192*6 + 72t c + 3288t d + (A8t h + 72t a ) (n - 
+N stte (288n ev t e + A8nt b + 24t a + 108t c + 1644t d ) 

T 2 = N slte L t {2A0t b + 72t c + Q576t d ) 

+N stte (288n ev t e + 24t a - U4t b + 108t c - 1644t d ) 



l)p] 



+q 



1) + Li( 13n 



73 \ 
~6J 



7n + 7 



(21) 



(22) 
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where t a ,tb,t c and td denote the average CPU time per floating-point oper- 
ation (FPO) for the four different kinds of vector operations (a)-(d) listed 
in Table Q respectively, t e the average CPU time per FPO for constructing 
the complementary vector from the projected eigenmodes of H^, and tf the 
time for computing the coefficients (fT7jl in the double pass. Note that setting 
t a = t b = t e = t d = t e = t f = 1 in (J2U) and flUJ) reproduce (O and pijl 
respectively. 

It should be emphasized that the numerical values of the constants and 
coefficients in Alty -^Jify ) may vary slightly from one implementation to another, 
however, the number of different terms and their functional dependences on the 
variables (N site , Li, n, n ev , p, q, t a ,tb,t c ,td,t e , andtf) should be the same for 
any codes of the single and double pass algorithms. 

For the double pass, it is clear that the first term on the r.h.s. of Eq. (|2Ujl 
is the most significant part, since the number of lattice site (N site ) is usually 
much larger than the number of iterations (L») of the CG loop such that the 
second and the third terms on the r.h.s. of (}2T)j) can be neglected. For example, 
N s ue = 8 3 x 24, Li = 1000, n = 16, n ev = 32 and q = 0.95, then the first term 
is 6888N site Li ~ 8.5 x 10 10 , while the sum of the second and the third terms 
only gives ~ 2.8 x 10 8 . Thus we can single out the most significant part of F 2 , 

F 2 ~ 6888 N S iteLi , (23) 

which comes from the first pass (|5|)-(|TT|) and the second pass ©, ©, ©? (HD) 
plus the vector update on the r.h.s. of ([TB^I. Similarly, for the single pass, the 
most significant part of F± is the first term on the r.h.s. of (|T9"|) 

F 1 ~ N site Li[?>hh2 + 120(n - l)p] , (24) 

which comes from the operations in (|B )) -([12) ) . and (|14p - (jl5|) . 

Evidently, from (|24p and (|23j) . F\ is a linearly increasing function of n while 
F 2 is independent of n. Thus it follows that there exists a threshold np such 
that F 2 < Fi for n > uf- From (|2^j) and (J2~3~j) . we obtain the threshold 

139 . . 

n F = 1 + -=- , 25 
op 

where the value of p depends on the convergence criterion for removing 
from the updating list, as well as the rational polynomial i?( n_1 ' n ) and its ar- 
gument. For our codes and the tests in the next section, p ~ 0.48, thus we 
have 

n F ~ 59 . (26) 

Assuming N site 3> L iy we obtain the most significant parts of the CPU 
times (PU)-© as 

Ti ~ iV flite L i [192t b + 72t c + 3288t (i + (48t 6 + 72t a )(n-l)p] , (27) 
T 2 ~ N site Li(240t b + 72t c + Qo7Qt d ) . (28) 
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Obviously, from (|27jl and (pSjl. there exists a threshold 



= 1 + (^Tso; (29) 

such that T 2 < T x (the double pass is faster than the single pass) for n> n T . 

Even though the countings in (|21|) and (|22|) are based on our codes (for 
with argument i/^,), the essential features of (|2*T|) and (|2*^j) should be 
common to all implementaions of the single and the double pass algorithms. 
In other words, the numerical coefficients in (|2*7jl and (|28|) may change from 
one implementaion to another, however, the existence of a threshold rir must 
hold for any implementation. 

Now it is interesting to compare with np. From (|2*3jl and (J2*9*|) . one 
immediately sees that n T < n F if 

19t d < llt a + 7t fe (30) 

is satisfied 2 . 

In practice, it turns out that t a /t d > 2 and t b /td > 3 for most systems 
(Tables ^HJ). Thus, ny ~ 12 — 25, which is quite smaller than np ~ 59. 

The speedup of the double pass with respect to the single pass (for n > tit) 
can be defined as 

5 = ^ (31) 

J-2 

which is estimated to be 

(3t a + 2t b )p 

d~ [n — nr) 32 

10t b + 3t c + 274*/ T> 1 ; 

where Eqs. ()27| ) -()29| ) have been used. 

In Tabled we hst our measurements of t a ,t b ,t c and td for four differ- 
ent hardware configurations of Pentium 4, i.e., two different Rambuses of 
faster/slower (PC1066/PC800) speed, and with/without SSE2 (the vector pro- 
cessing unit of Pentium 4) codes. 

Substituting the values of t a , t b and td into (|2*9*J) . we obtain the theoretical 
estimates for the threshold n?, 

Pentium 4, PC800, with SSE2, 
_ , __. Pentium 4, PC800, , , 

n T - \ i q Pentium 4) PC1066, with SSE2, ^ ' 

Pentium 4, PC1066, 




where p ~ 0.48 has been used. 

2 Note that the inequality (|30|) is more restrictive than 685^ < 417i a + 268t b 
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Table 1: The average CPU time (in unit of nano-second) per floating point 
operation (FPO) for 4 different kinds of matrix-vector operations in the single 
and double pass algorithms. The CPU is Pentium 4 (2.53 GHz), with one 
Gbyte Rambus (PC800 or PC1066). 





Operation 


# of FPO 


CPU time(ns) per FPO 








PC800 


PC1066 








SSE2 on off 


SSE2 on off 


(a) 


\A) = Cl \A) + c 2 \B) 


72 x N site 


3.720 3.721 


2.977 3.016 


(b) 


\V) = \A) + c\B) 


48 x N slte 


5.521 5.522 


4.330 4.429 


(c) 


a = (V\V) 


36 x N site 


4.249 4.251 


3.312 3.340 


(d) 


\A) = H W \B) 


1644 x N site 


0.764 1.535 


0.686 1.440 



Note that for each hardware configuration in Table ^ the average CPU 
time per FPO of the simple vector operations (a)-(c) is much longer than that 
of (d), Wilson matrix times vector. A simplified explanation 3 is as follows. 
Since all these four vector operations involve long vectors, the CPU and its 
cache cannot hold all data at once. Thus it is necessary to transfer the data 
from/to the memory successively, every time after the CPU completes its op- 
erations on a portion of the vectors. However, for any system, the memory 
bandwidth is limited. Thus, there is a time interval between consecutive sets 
of data transferring to/from the CPU. Therefore, if the CPU finishes a com- 
putation before the next set of data is ready, then it would waste its cycles in 
idling. Since any one of the vector operations (a)-(c) is rather simple, the CPU 
finishes a computation at a speed faster than that of transferring data from/to 
the memory, thus the CPU ends up wasting a significant fraction of time in 
idling. On the other hand, for the vector operation (d), the number of FPO is 
much more than that of any one of (a)-(c), thus when the CPU completes its 
operations on a portion of the vectors, the next set of data might have been 
ready, so the CPU does not waste much time in the memory I/O. This explains 
why the average CPU time per FPO of (a)-(c) is much longer than that of (d). 
Further, this simple picture also explains why turning on SSE2 of Pentium 4 
(see Table IH) doubles the speed of (d) but has no speedups for (a)-(c), since 
the bottleneck of (a)-(c) is essentially due to the memory bandwidth rather 
than the speed of the CPU. 

If the memory bandwidth is the major cause for the inefficiency of the 
simple vector operations (a)-(c), then using faster memories would increase 
the speeds of (a)-(c) more significantly than that of (d). From Table [U we can 
compare the speedups of these four vector operations as the (slower) PC800 is 

3 It should be emphasized that the mcchansim of the interactions between the CPU and 
the RAM is a rather complicated process, which is beyond the scope of this paper. 
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Table 2: Similar to Tabled except for the platforms: IBM SP2 SMP (Power 
3 at 375 Mhz) with 4 Gbyte memory, and DEC alpha XP1000 (21264A at 667 
Mhz) with 1.5 Gbyte memory. 





Operation 


# of FPO 


CPU time 


^ns) per FPO 








IBM 


DEC 


(a) 


\A) = Cl \A) + c 2 \B) 


72 x N site 


5.269 


7.232 


(b) 


\V) = \A) + c\B) 


48 x N site 


10.98 


12.91 


(c) 


a = (V\V) 


36 x N site 


6.209 


7.684 


(d) 


\A) = H W \B) 


1644 x N slte 


2.379 


3.054 



repalced with (faster) PC1066. We find that the speedup for (a)-(c) is 27%, 
but that for (d) is only 11%. Thus the speedups are consistent with above 
picture. 

Obviously, the inefficiency of vector operations (a)-(c) should exist in any 
platforms, not only for the Pentium 4 systems. To check this, we measure 
t a ,t b ,t e and t d for IBM SP2 SMP (Power 3 at 375 Mhz) and DEC alpha 
XP1000 (21264A at 667 Mhz) respectively. The results are listed in Table 
which give 

f 21, DEC alpha XP1000, . , 

nT ~ \ 20, IBM SP2 SMP. ^ > 

Although it is impossible to go through all platforms and measure the 
values of t a , t& and td individually, it is expected that t a /td > 1 and U/td > 1 
(such that the inequality (jHD|) is amply satisfied) is a common feature of most 
systems. In other words, we expect that the double pass is faster than the 
single pass for n > nx — 12 — 25, at least for most platforms. 

Recall that in Neuberger's test run with SGI O2000, at n — 32, the double 
pass is faster than the single pass by 30% [5]. This is not a surprise at all, in 
view of similar speedups of other systems at n = 32. For example, for IBM 
SP2 SMP or DEC alpha XP1000, sustituting the values of t a , £&, t c and td 
(from Tabled into we find that S = T x /T 2 - 1 ~ 30% at n = 32, which 
also agrees with the actual measurements given in the next section (see Table 
HJ). Thus, the speedup S of the double pass for n > tit with nr quite smaller 
than rip is a generic feature of any platform, stemming from the fact that the 
vector operations in the second pass is more efficient than those Ji^| )-(f?3j ) in 
the single pass (i.e., t a > td and tb > td). 

Nevertheless, the salient feature of and is that the number of 
floating point operations and the CPU time for the double pass are almost 
independent of n. Thus one can choose n as large as one wishes, with only 
a negligible overhead. For example, for the 16 3 x 32 lattice, with Lj = 1000, 
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n ev = 20, and q = 0.95, the increment of Ti from n = 16 to n = 200 is less 
than 0.05%. In other words, one can approximate (i/^,)" 1 / 2 !" (i.e., preserve 
the chiral symmetry) to any precision as ones wishes, without noticeably extra 
costs. This is the virtue of Neuberger's double pass algorithm, which may have 
been overlooked in the last five years. 



In this section, we perform several tests on the single and the double pass 
algorithms, and compare the theoretical thresholds ut (j2Uj) and np (|2~5j) with 
the measured values. 

In Table HH we list the number of floating point operations and the CPU 
time for computing one column of the inverse of 



D{m q )D\m q )Y = [m 2 q + (m 2 - m 2 q /4)[2 + (75 ± 1)S(H w )]}y = 1 (35) 



with multi-mass (outer) conjugate gradient for a set of 16 bare quark masses 
(0.02 < m q < 0.3), while the inner CG (jSJ is iterated with the single pass 
and the double pass respectively. The tests are performed on the 8 3 x 24 
lattice with SU(3) gauge configuration generated by the Wilson gauge action 
at /3 — 5.8. Other parameters are: mo = 1.30, n ev = 32 (the number of 
projected eigenmodes), \ ma x/^min — 6.207/0.198 (after projection), and the 
tolerances for the outer and inner CG loops are 1.0 x 10~ n and 2.0 x 10~ 12 
respectively. The total number of iterations L Q for the outer CG loop is around 
100 ~ 103, while the average number of iterations for the inner CG loop is 



With the formulas (|19 p -()22j h we can estimate the number of floating point 
operations and the CPU time for computing one column of D~ 1 (m q ) for a 
number n q of bare quark masses. For the number of floating point operations, 
our results are 

G k = (L + n q )F k + 

N site (60L o n q + 8AL a + 66n q ) + 16L n q - 13L Q + 18n q + 2 (36) 

where L is the number of iterations of the outer CG loop (J35|) . the subscript 
k = 1(2) stands for the single (double) pass respectively. Obviously, the most 
significant part of G k is the first term on the r.h.s. of (|36jh thus 



4 Tests 




m q + (m - m ff /2)(l + j 5 S(H w )) 



i.e., D l {m q ) = D(m q yY, where Y is solved from 



287. 



G k ~ (L + n q )F k , k = 1, 2 . 



(37) 
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Table 3: The number of floating point operations and the CPU time (in unit 
of second) for Pentium 4 (2.53 GHz) with one Gbyte Rambus (PC1066) to 
compute one column of D~ l (m q ) for 16 quark masses, versus the degree n of 
the rational polynomial _R( n_1 > n ) in polar approximation (J2J). 





Double Pass 




Sin 


gle Pass 








# of FPO 


CPU Time(s) 


# of FPO 


CPU Time(s) 


o 




n 


G 2 


V 2 measured 


Gi 


Vi 


measured 


polar 




12 


2.90 x 10 12 


2456 


2451 


1.68 x 10 12 


2342 


2241 


6 x 10" 


-5 


13 


2.90 x 10 12 


2456 


2452 


1.71 x 10 12 


2429 


2372 


3 x 10" 


-5 


14 


2.90 x 10 12 


2456 


2454 


1.75 x 10 12 


2515 


2520 


1 x 10- 


-5 


16 


2.90 x 10 12 


2456 


2454 


1.81 x 10 12 


2689 


2714 


3 x 10- 


-6 


32 


2.90 x 10 12 


2458 


2456 


2.25 x 10 12 


4097 


4089 


3 x 10- 


11 


34 


2.90 x 10 12 


2458 


2458 


2.30 x 10 12 


4273 


4278 


7 x 10- 


12 


40 


2.90 x 10 12 


2458 


2456 


2.45 x 10 12 


4803 


4819 


1 x 10- 


13 


56 


2.90 x 10 12 


2460 


2460 


2.86 x 10 12 


6218 


6261 


2 x 10- 


14 


59 


2.90 x 10 12 


2460 


2460 


2.93 x 10 12 


6483 


6491 


2 x 10- 


14 


60 


2.90 x 10 12 


2460 


2461 


2.96 x 10 12 


6572 


6604 


2 x 10- 


11 


64 


2.90 x 10 12 


2460 


2461 


3.06 x 10 12 


6926 


6965 


2 x 10- 


11 



Similarly, the most significant part of the CPU time is 

V k ~ (L + n q )T k , k = 1, 2 . (38) 

where T\ and T 2 are given in (|2T |l -(|22 |) . 

In Table 01 the estimated CPU times V\ and V 2 are in good agreement with 
the measured CPU times (the deviation is always less than 5%). By comparing 
the CPU times for the single pass and the double pass, we see that the double 
pass becomes faster than the single pass at n ~ 13, in agreement with the 
theoretical estimate (J33j) for p = 0.48, where p is obtained by measuring the 
effective number of the (n — 1) vector pairs {P®, Z^ l \ I — 2, • • • , n} which are 
updated before converges. 

Further, comparing G 2 and G%, we see that G\ ~ G 2 at uf — 59, in 
agreement with the theoretical estimate (|26J) for p = 0.48. 

Also, in Table the remarkable feature of the double pass algorithm is 
demonstrated: the number of floating point operations {G 2 ) and the CPU time 
are almost independent ofn. Thus n can be increased to 64 or any higher value 
such that the chiral symmetry is preserved to any precision as one wishes. The 
chiral symmetry breaking or the error of the rational approximation R( n ~ 1 > n ) 
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Table 4: The CPU time (in unit of second) for the single and the double pass 
algorithms to compute one column of D~ 1 (m q ) for 16 quark masses, versus the 
degree n of the rational polynomial i?( n_1 ' ra ) in polar approximation 



n 


P4 PC800 


IBM SP2 SMP 


DEC alph 


a XP1000 




2-pass 


1-pass 


2-pass 


1-pass 


2-pass 


1-pass 


20 


4922 


4627 


7701 


7674 


9921 


9868 


21 


4930 


4794 


7711 


7881 


9924 


10197 


22 


4918 


4940 


7710 


8090 


9931 


10531 


24 


4921 


5166 


7705 


8529 


9929 


11125 


26 


4920 


5433 


7710 


8990 


9929 


11599 


32 


4918 


6167 


7718 


10138 


9926 


13043 



due to a finite n can be measured by 

WW , 



a = max 

Y 



yty 



W = S{H W )Y 



(39) 



which is shown in the last column of Table El 

To check the theoretical estimates for the threshold tit in (}3"4*j) . we repeat 
the tests of Table El for Pentium 4 (PC800), IBM SP2 SMP, and DEC alpha 
XP1000 respectively The results are listed in Table EJ Obviously, in each 
case, the double pass is faster than the single pass for n > 20 ~ 22, in good 
agreement with the theoretical estimates in ([34)1 . Further, at n = 32, the speed 
of the double pass is faster than the single pass by 25%, 31%, and 31% for 
these three platforms respectively, compatible with what Neuberger found in 
his test run with SGI O2000 [Sj. Note that for Pentium 4, using SSE2 code 
increases the speedup of the double pass to 66% at n = 32 (see Table thus 
makes the double pass alogorithm even more favorable for P4 clusters. 

At this point, it may be interesting to repeat the tests of Table El but 
replacing the polar approximation (j2J) with the Zolotarev optimal rational ap- 
proximation, 



S opt {H w ) = h w ^2 , 2 7 j 

1=1 u w c 2l-l 



TT r>(n-l,n) 



2 ) 

WJ ) 



hy] H w /\ m i n , (40) 



where 



R 



{n-l,n) (U 2 , _ d'o nr=l 1 ( 1 + Kol c 2l) 



E 



^min IL=l(l + h>w/ C 2l-l) ^min i=1 + c' 2 i_ 1 



(41) 



and the coefficients d' Q , b\ and c[ are expressed in terms of Jacobian elliptic 
functions |B1 13 Ej with arguments depending only on n and A^ a2 ,/A^ in (\ max 
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Table 5: The number of floating point operations and the CPU time (in unit 
of second) for Pentium 4 (2.53 GHz) with one Gbyte Rambus (PC1066) to 
compute one column of D~ l (m q ) for 16 quark masses, versus the degree n of 
the Zolotarev rational polynomial . 





Double Pass 


Sin 


sde Pass 












# of FPO 


CPU Time(s) 


# of FPO 


CPU Time(s) 






o 




n 


G 2 


v 2 


measured 


Gt 


Vi 


measured 


Zolotarev 


12 


2.90 x 10 12 


2456 


2450 


1.72 x 10 12 


2309 


2274 


7 


X 


10" 


-ii 


13 


2.90 x 10 12 


2456 


2452 


1.75 x 10 12 


2398 


2404 


8 


X 


10" 


-12 


14 


2.90 x 10 12 


2456 


2455 


1.78 x 10 12 


2485 


2463 


1 


X 


io- 


-12 


16 


2.90 x 10 12 


2456 


2455 


1.83 x 10 12 


2659 


2638 


3 


X 


io- 


-14 


32 


2.90 x 10 12 


2458 


2458 


2.25 x 10 12 


4058 


4068 


3 


X 


io- 


-14 


34 


2.90 x 10 12 


2458 


2458 


2.30 x 10 12 


4233 


4245 


3 


X 


io- 


-14 


40 


2.90 x 10 12 


2458 


2460 


2.45 x 10 12 


4759 


4795 


3 


X 


io- 


-14 


56 


2.90 x 10 12 


2460 


2462 


2.86 x 10 12 


6159 


6180 


3 


X 


io- 


-14 


59 


2.90 x 10 12 


2460 


2462 


2.92 x 10 12 


6423 


6459 


3 


X 


io- 


-14 


60 


2.90 x 10 12 


2460 


2460 


2.95 x 10 12 


6510 


6544 


3 


X 


io- 


-14 


64 


2.90 x 10 12 


2460 


2462 


3.05 x 10 12 


6860 


6903 


3 


X 


io- 


-14 



and X m in are the maximum and the minimum of the the eigenvalues of | H w | ) . 
The results are listed in Table El 

Comparing Table HU with Table it is clear that for the single pass with 
n < 32, Zolotarev optimal approximation is better than the polar approxi- 
mation, in terms of the precision of the approximation (a). However, for the 
double pass, the polar approximation seems to be as good as the Zolotarev 
approximation since the degree n can be pushed to a very large value, with 
negligible extra CPU time. In other words, with the double pass algorithm, it 
does not matter which rational approximation one uses to compute D~ l (m q ) 
in a gauge background. This seems to be a rather unexpected result. 

5 Concluding remarks 

So far, we have restricted our discussions to the sign function with argument 
H w . However, it is clear that the salient features of the double pass algorithm 
are invariant for other choices of the argument, e.g., improved Wilson operator. 
In general, the double pass algorithm is a powerful scheme for the matrix- vector 
product R(H 2 ) ■ Y, where R can be any rational polynomial R with argument 
H 2 (positive definite Hermitian operator), not just for (if 2 ) -1 / 2 . 
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The virtue of Neuberger's double pass algorithm is its constancy in speed 
and memory storage for any degree n of the rational approximation, where its 
constancy in speed is valid under a mild condition (N site ^> Lj) which can be 
fulfilled in most cases. Further, the double pass is faster than the single pass 
even for n as small as 12 (Pentium 4), and it is expected that the threshold 
in? — 12 — 25 for most systems. Thus, it seems that there is not much room 
left for the single pass algorithm with Zolotarev approximation, unless the 
number of inner CG iterations is exceptionally large, which could happen if 
the low-lying eigenmodes of are not projected out and treated exactly. 

Note that can be tridiagonalized by the conjugate gradient (|5j)- (jTTl) . 
with the unitary transformation matrix U formed by the normalized residue 
vectors {Rj,j = 0, ■ • ■ , i}, and the elements of the tridiagonal matrix expressed 
in terms of the coefficients {atj, j3j,j = 0, ■ • ■ , i} (up to the tolerance of the 
conjugate gradient), i.e., 

U*H*U ~ T (42) 

where 

U kj = . {Rj)k , (43) 
and T is a symmetric tridiagonal matrix with nonzero elements 

T " = ^ + ^' ^ 

T 3+i,i = T j,j+i = ^ , j = 0, • • • , i. (45) 

a j 

Thus, after running the first pass of the CG loop ([fi|)- ([TT|l . T can be constructed 
from the coefficients {atj,[3j}, and diagonalized by an orthogonal transforma- 
tion 

T = OAO . (46) 
Then the matrix-vector product (if^) _1 / 2 y can be evaluated as 

' Y ~ UO^=OU ] Y = h R i ( 47 ) 

3=0 



IHI VA 

where 



h ~ O jm —=0 Qm « j °| °| . (45 

*m \ \ n j\ n jl 



m=0 
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Here the summation on the r.h.s. of (|4"7j) is obtained by running the second pass 
of the CG loop {©,©,©, (EH)}; an d adding IjRj to the sum successively 
from j — to i. 

It is well known that (any positive definite Hermitian matrix) can be 
tridiagonalized by Lanczos iteration [UJ HO] as well as the conjugate gradient. 
The connection between the Lanczos iteration and the conjugate gradient for 
the tridiagonalization of a positive definite Hermitian matrix has been well 
established [Oj, and both have almost the same performance in practice. In 
Ref. [T^, the Lanczos approach was proposed for the matrix- vector product 
(iJ^) _1 / 2 F, and its variant (replacing Lanczos iteration with the conjugate 
gradient) was used in Ref. [P2j . 

The only difference between the Lanczos (CG) algorithm and Neuberger's 
double pass algorithm is the diagonalization of the tridiagonal matrix T and 
the computation of the coefficients {/.,•} (I48|) in the former versus the compu- 
tation of the coefficients {cj} (|17|) in the latter. Since the number of floating 
point operations for the diagonalization of a symmetric tridiagonal matrix T is 
~ 3Lf (where Lj is the number of iterations of the inner CG loop, or the size of 
T), it is compatible with that of computing the coefficients {cj}, i.e., the last 
term on the r.h.s. of (J2U)) . Thus we expect that the performance (speed and 
accuracy) of Lanczos (CG) algorithm and Neuberger's double pass algorithm 
are compatible. 

In Table El we compare the Lanczos (CG) algorithm with Neuberger's 
double pass algorithm, by computing one column of D^ 1 (m q ) (for 16 bare 
quark masses) on the 16 3 x 32 lattice with SU(3) gauge configuration generated 
by the Wilson gauge action at (3 — 6.0. Other parameters are: m = 1.30, 
n ev = 20 (the number of projected eigenmodes), Xmax/^min = 6.260/0.215 
(after projection), and the tolerances for the outer and inner CG (Lanczos) 
loops are 1.0 x 10 -11 and 2.0 x 10~ 12 respectively. The number of iterations 
for the outer CG loop is L = 347, while the average number of iterations for 
the inner CG loop is ~ 300. Evidently, these seemingly different algorithms 
have almost the same speed as well as the accuracy (a). 

Thus, for quenched lattice QCD, one has several compatible options to 
compute the quenched quark propagator 

(Dc + mq)- 1 = (l-rm q y 1 lD- 1 (m q )-r] , r = — - , (49) 

2777-0 

even though we have chosen Neuberger's double pass algorithm to solve D~ 1 (m q ) 
in our recent investigation [T3*j . Nevertheless, for lattice QCD with dynamical 
quarks, the quark determinant detD(m q ) could not be computed directly with 
existing algorithms and computers. If detD(m q ) is incorporated through the 
dynamics of 277 pseudofermion fields (where n can be regarded as the degree 
77 in the rational polynomial i?( n-1,n )), then an additional degree of freedom 
(or the fifth dimension with iV s = 2n sites) has to be introduced. Thus a rele- 
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Table 6: The number of floating point operations and the CPU time (in unit 
of second) for Pentium 4 (2.53 GHz) with one Gbyte Rambus (PC1066) to 
compute one column of D _1 (m g ), versus different algorithms. 





Double pass algorithm 


Lanczos (CG) algorithm 




Polar(n = 128) 


Zolotarev(n = 16) 


Lanczos CG 


FPO 


9.49 x 10 13 


9.49 x 10 13 


9.54 x 10 13 9.51 x 10 13 


Time (total) 


94543 


94632 


97824 94722 


Time (2nd pass) 


46281 


46303 


46353 46174 


o 


1 x 10" 14 


1 x icr 14 


1 x 10" 14 1 x icr 14 



vant question is how to reproduce detD(m q ) accurately with the minimal N s . 
A solution has been presented in Ref. [Hj. On the other hand, it would be 
interesting to see whether there is an algorithm to drive the dynamics of these 
N s pseudofermion fields such that the cost is almost independent of N s = 2n. 
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